Audio Alchemy

INFO 523 - Final Project

Project description
Author
Affiliation

Nathan Herling & Yashi Mi

College of Information Science, University of Arizona

Python libraries

Abstract

Music recommendation systems increasingly rely on machine learning to capture the complexity of user preferences, yet existing models struggle to account for language diversity and nuanced audio features in songs. This project applies signal processing, vocal separation (DEMUCS library), and machine learning techniques to develop a framework for classifying both music genres and song languages, integrating these predictions with genre metadata for improved personalization. By combining automated data collection with advanced audio analysis, the system provides a foundation for smarter, more inclusive recommendation platforms that enhance user experience across diverse musical contexts. The focus of the project was: langauge and genre recognition. For language recognition, classical models—including Logistic Regression, Random Forests, and SVMs—were trained on extracted statistical and time-frequency features using 5-fold cross-validation. These models showed modest predictive performance, with accuracy, precision, recall, and F1-scores generally ranging from 10–60%, while vocal features provided stronger signals than instrumental components. Next, KNNs and Random Forests were applied with ‘genre’ as the target variable. Finally, CNNs were applied to Mel spectrogram images—both grayscale and color scale—with train/validation/test splits, early stopping, and hyperparameter sweeps to capture complex audio patterns. While all models had limited performance, CNNs have strong theoretical potential, as reported in the literature, and improved recognition compared to classical models, highlighting the promise of deep learning and feature engineering for future music recommendation and language identification systems.

Introduction

Music genre classification is a central task in the field of music information retrieval, combining elements of signal processing, machine learning, and deep learning. Accurate genre identification not only enhances music recommendation systems and streaming platforms but also deepens our understanding of audio structure and human perception of sound. Traditional approaches have relied on handcrafted audio features analyzed with machine learning techniques such as Random Forests and Gaussian Mixture Models, offering interpretable yet limited performance [1]. Recent advances, however, leverage deep learning methods—particularly convolutional neural networks (CNNs)—to extract high-level representations directly from spectrograms, achieving state-of-the-art results [2]. This project explores both paradigms: first applying classical machine learning with 5-fold cross-validation, and then advancing to CNN-based classification on spectrogram heat maps, with results evaluated using standard metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.

A note on spectographic features of .mp3 vs. .wav

Initially, it was assumed that the typical size difference between .mp3 and .wav files would reflect meaningful differences in their spectral properties. However, analysis revealed that the differences were minimal, leading us to abandon the idea of using the two file types as comparative baselines for our models. As shown in Figure 1, the typical similarity between .wav and .mp3 files exceeds 98%, rendering the expectation of differing training results between the two data types a moot point.

MP3 vs WAV

Figure 1: Frequency histogram comparison between MP3 and WAV audio files.

A note on project software

It’s generally easier and faster to run Python scrips separately and use the results in the discussion of the results. All scripts used are stored in:
Herling-Mi_extra\0_mL_scripts\0_p1
Herling-Mi_extra\0_mL_scripts\0_p2

Questions

1. Language Recognition with Separated Vocal & Audio Tracks

Initial Problem Formulation

How can we leverage statistical and time-frequency features extracted from separated vocal and audio tracks to build effective language recognition models? Specifically, how can traditional machine learning methods — ranging from classical classifiers on simple statistical summaries to Gaussian Mixture Models on richer time-frequency features — be applied in this context?

  • What are the key benefits and limitations of these approaches?
  • How can careful feature engineering, feature integration, and thorough model evaluation improve the accuracy and robustness of language recognition systems?
  • How do model results compare and contrast when using .wav files versus .mp3 files?

Secondary Problem formulation

From the initial formulation, we refined the question to specifically compare how different ablations of the audio track (complete song, vocal-only, and non-vocal) affect model performance.

  • How does model performance differ when predicting song language using features from complete songs, vocal-only tracks, and instrumental-only tracks?

  • What are the relative strengths and limitations of classical machine learning models (Logistic Regression, Random Forest, SVM) when applied to language recognition?

2. Recommendation Systems Using Audio Features & User Data

Initial Problem Formulation

How can user interaction data, combined with basic track metadata and simple audio features, be used to build an effective recommendation system using collaborative filtering and traditional machine learning methods?

  • Furthermore, how can advanced audio features, dimensionality reduction, and clustering techniques improve personalized recommendations by better capturing user preferences and track characteristics from both vocal and non-vocal components?
  • How do recommendation model results compare and contrast when using .wav files versus .mp3 files, considering the potential impact of audio quality and compression artifacts on feature extraction and recommendation performance?

Secondary Problem Formulation

We abandoned the use of .wav versus .mp3 formats for the reasons previously mentioned. Instead, the idea of using heat maps/spectrograms was discovered and pursued. A CNN was built for both grayscale and viridis-scale inputs. As will be discussed in the Problem Analysis and Results section, training outcomes were poor despite what initially seemed to be a solid approach. The current hypothesis is that the dataset is too small to produce a robust model, that the extracted song metrics are insufficient to support effective training, or a combination of both. The Likert scale—with options of ‘Likert 2,’ ‘Likert 3,’ and ‘Likert 5’—was not employed, as it represents a second phase of recommendation by genre, contingent on reliable genre recognition. ## Dataset

data provenance

The data collection process involved several custom Python scripts designed to scrape and download the necessary information and audio files:

artist_5_song_list_scrape.py — Retrieves the top five songs per artist from Google search results.

artist_genre_scrape.py — Gathers genre metadata for each artist from public sources.

artist_country_of_origin_scrape.py — Extracts the country of origin for each artist.

audio_scrape_wav_mp3.py — Downloads audio files from YouTube links in WAV and MP3 formats.

Together, these scripts automate the extraction of both audio data and relevant metadata to support training and evaluation of the recommendation system.

For question 1
A total of 123 songs were scraped, each was turned into the triplicate of (1) complete song, (2) audio only (3) vocal only. Yielding 369 observations to work with. The complete extraction pipeline for question 1 was around 15 hours.

For question 2
A total of 20 genres were examined, each with 10 example songs from relevant artists - yielding a set of 200 observations to work with. The complete extraction pipeline for question 2 was around 5 hours - due to the use of parallel threads in a reconfigured software file.

software distriubtion

Initially, the plan was to distribute a software package to both partners so they could each collect their song files and extract the data locally. However, due to the ambitious goals and the multifaceted software requirements needed to accomplish them, the team soon felt as if we were flying the plane while building it. Ultimately, one team member (Nathan) took responsibility for collecting the song file data and generating the features. These features were then distributed to other members, replacing the originally envisioned feature-extraction software suite.

data features

In addition to the artist and song name, the features listed in Table 1 were scraped for each track. The final selection of features was guided as much by curiosity—‘I wonder what this will do’—as by deliberate planning. Research was done - but, until you try to build the model yourself you’re not aware of what works under what condtions.

🔍 Feature Scraping

Feature Description
fundamental_freq Fundamental frequency (mean pitch via librosa.pyin)
freq_e_1 Dominant spectral energy #1 (highest energy frequency bin)
freq_e_2 Dominant spectral energy #2 (2nd highest energy frequency bin)
freq_e_3 Dominant spectral energy #3 (3rd highest energy frequency bin)
key Estimated musical key (C, C#, D, …, B) via chroma features
duration Length of audio in seconds
zero_crossing_rate Average zero crossing rate (signal sign changes)
mfcc_mean Mean of 13 MFCC coefficients (timbre features)
mfcc_std Standard deviation of MFCC coefficients
tempo Estimated tempo in beats per minute (BPM)
rms_energy Root mean square energy (loudness measure)
track_type Audio track type (0=full mix, 1=vocal only, 2=no vocals)
mel_spectrogram Mel-scaled spectrogram representing frequency content over time (human hearing range)

Table 1: Extracted Audio Features

data storge

This JSON schema, as proposed in the original plan, was refactored as needed. The multiple pipeline components required to gather and merge the information made it more expedient to use a combination of the .json design and .csv files. Shown below is the main .json design used for the project.

{ "AudioFile": { "yt_link": [], "wav_link": [], "mp3_link": [] }, "Region": [ "America" ], "Data": { "Artist": "Lady Gaga", "Song Title": "", "Genre": [], "Mean (of features)": null, "Variance": null, "Skewness": null, "Kurtosis": null, "Zero Crossing Rate": null, "RMS Energy": null, "Loudness": null, "Energy": null, "Tempo": null, "Danceability": null, "Key / Key Name": "", "Mode / Mode Name": "", "Mel-Spectrogram": null, "Duration (ms)": null } }

Data - Second Tier Features

After the initial .wav data collection, the following '2nd tier' data were generated/collected:

  • .mp3
  • Separated vocal track
  • Separated audio track
  • Spectrogram data (viridis scale)
  • Spectrogram data (grey scale)

The Demucs library turned out to be easy to use and very good at vocal and background track separation. Scripts that run Demucs using system commands—typically through Python’s subprocess or os libraries—offer a straightforward way to integrate audio separation tools into Python workflows while interacting with the operating system’s file structure and command-line utilities. In order to run files in parallel, a main_script.py capable of generating mulitple threads would call a worker_script.py such as the one below.

Below are two representatives of spectrogram feature extraction.

~Q

image 1.
Queen_we_will_rock_you - spectrograph - grey scale

Billy~

image 2.
Billie Eilish - bad guy - spectrograph - vridis scale

import os
import subprocess

# Path to your input audio file
audio_file = r"~\Gloria_Gaynor_I_Will_Survive.wav"

# Optional: check if file exists
if not os.path.exists(audio_file):
    raise FileNotFoundError(f"Audio file not found: {audio_file}")

# Build the Demucs command
# You can change --two-stems to 'drums' or 'bass' if needed
command = [
    "demucs",
    "--two-stems=vocals",  # Extract vocals only
    "--out", "demucs_output",  # Output folder
    audio_file
]

# Run the command
print("🔄 Running Demucs...")
subprocess.run(command)

print("✅ Separation complete. Check the 'demucs_output' folder for results.")

Team member workload

Our project workload followed a structured week-by-week workflow as proposed in the initial proposal, with responsibilities distributed among team members. We began by finalizing and sharing the proposal, followed by the individual collection and organization of ~200 audio files per person. Nathan Herling led the processing and validation of metadata, while each member focused on building machine learning pipelines and conducting iterative testing. The project concluded with a collaborative effort on final model evaluation, report preparation, and presentation development.

Problem analysis and results

General

Easy and medium paths, as proposed in the proposal morphed into multiple paths to tackle the problem, some bearing fruit, some not. In Q1, it could be argued that three easy paths were taken in an attempt to explore which may work better, and in Q2 two easy paths and a medium path were explored.

Q1 - Yashi

How can we leverage audio features from separated vocal and instrumental tracks to improve language recognition in music?

Data Collection: The dataset consisted of ~200 audio files, preprocessed into three ablations: complete songs, vocal-only tracks, and instrumental-only tracks. Features included time-domain statistics (mean, variance, skewness, kurtosis).

Data Processing: All features were standardized using global scaling. Encoded target variable (language) with LabelEncoder.No major imputation was required as missingness was minimal.

Model Selection: I evaluated three models: Logistic Regression, Random Forest, and Support Vector Machines with linear kernels. These models were chosen for their balance of interpretability, robustness, and suitability for structured feature data. Training and evaluation were conducted using 5-fold stratified cross-validation to ensure reliable performance comparisons across models.

Validation & Metrics: Evaluation focused on accuracy, precision, recall, and F1-score. Confusion matrices were used to analyze per-class misclassification patterns.

Model Evaluation:

Ablation Model Accuracy Precision Recall F1
complete_song LogReg 0.399 0.398 0.447 0.387
complete_song RandomForest 0.626 0.469 0.410 0.401
complete_song SVM_linear 0.432 0.443 0.504 0.427
vocal_only LogReg 0.560 0.531 0.567 0.509
vocal_only RandomForest 0.552 0.426 0.404 0.385
vocal_only SVM_linear 0.544 0.542 0.583 0.514
no_vocal LogReg 0.333 0.371 0.364 0.316
no_vocal RandomForest 0.577 0.436 0.349 0.328
no_vocal SVM_linear 0.366 0.418 0.411 0.347
Column Min - 0.333 0.371 0.349 0.316
Column Max - 0.626 0.542 0.583 0.514

Results:

  • Vocal-only tracks: Provided the best classification signal, with SVM achieving ~0.51 macro F1, outperforming Random Forest and Logistic Regression.

  • Complete songs: Models achieved moderate performance (~0.40 F1), reflecting a mixture of useful vocal cues diluted by instrumental content.

  • Non_vocal tracks: Accuracy dropped to ~0.50 (random baseline), validating the expectation that language recognition requires vocal content.

  • Future reommendations: larger data sets and more hyperparameter exploration

Q2

How can we leverage audio features to construct a machine learning model capable of genre recognition?

  • Data Collection:
    Data collection was performed with python scripts for each feature listed in Table 1. Ten genres were chosen and twenty representative artists for each genre - were both choosen by google search.
  • Data Processing:
    All data was present, no imputation was needed. It was decided to not eliminate statistical outliers, since the model design hasn’t been explored thoroughly enough to warrant selecting out data.
  • Model Selection:
    Three supervised machine learning methods were chosen: (1) Knn [classic baseline] (2) Random Forest [with the hope of good baseline results] (3) CNN performed with the numerical dataset and spectograph extracted files.
  • Model Validation:

(1) KNN

  • LOOCV: Validates and selects the best hyperparameters.
  • 5-Fold CV learning curve: Validates generalization performance as a function of training size.

(2) Random Forest

  • Cross validation
  • Learning curve

(3) CNN

  • Early stopping

Model Metrics

(1) KNN - hyperparameter sweep:

  • n_neighbors: [1, 3, 5, 7, 9]
  • weights: ['uniform', 'distance']
  • metric: ['euclidean', 'manhattan']
  • p: [1, 2] (only relevant if metric='minkowski')

(2) Random Forest - hyperparameter sweep:

  • n_estimators: [10, 50, 100, 200]
  • max_depth: [None, 5, 10, 15]
  • min_samples_split: [2, 5, 10]
  • min_samples_leaf: [1, 2, 4]

(3) CNN - hyperparameter sweep:

  • conv_filters: [[32, 64], [64, 128], [32, 64, 128]]
  • kernel_size: [(3,3), (5,5)]
  • dropout: [0.2, 0.3, 0.5]
  • learning_rate: [0.01, 0.001, 0.0001]

Metrics and Results (Knn)

Hyperparameter Sweep

Hyperparameter Swee

Learning Curve 2

Learning Curve Knn

Best Hyperparameters
knn__algorithm auto
knn__n_neighbors 15
knn__p 1
knn__weights uniform

Metric Score
Test Precision (weighted) 0.1017
Test F1 Score (weighted) 0.1117
LOOCV Accuracy 0.1350

Class Precision Recall F1-Score Support
0 0.00 0.00 0.00 4
1 0.25 0.25 0.25 4
2 0.20 0.25 0.22 4
3 0.17 0.25 0.20 4
4 0.00 0.00 0.00 4
5 0.00 0.00 0.00 4
6 0.00 0.00 0.00 4
7 0.20 0.25 0.22 4
8 0.20 0.25 0.22 4
9 0.00 0.00 0.00 4
Accuracy 0.12
Macro Avg 0.11
Weighted Avg 0.11

Knn Evaluation
This learning curve reveals a significant gap between training and cross-validation performance for your KNN classifier:

🔵 Training Score: The model achieves a perfect F1 score of 1.0 across all training set sizes, which is a strong indicator of overfitting—the model memorizes the training data rather than generalizing from it.

🟢 Cross-Validation Score: Starts near 0.0 and only climbs to about 0.2 even with 160 training samples. This suggests the model struggles to generalize and perform well on unseen data.

📉 Implication: Despite using the best hyperparameters, the model may be too sensitive to noise or lacks sufficient complexity to capture meaningful patterns. KNN’s reliance on local structure might be failing due to sparse or high-dimensional data.

Metrics and Results (Random Forest)

Hyperparameter Sweep

Hyperparameter Sweep - RF

Learning Curve 2

Learning Curve - RF

Metric Value
max_depth 10.0
min_samples_leaf 1.0
min_samples_split 2.0
n_estimators 200.0
accuracy 0.925
precision 0.927554
recall 0.925
f1_score 0.924728

Random Forrest Analysis
Yes, this random forest model appears to be overtrained, and here’s why:

🔍 Key Indicators of Overtraining

  • Training Accuracy = 1.0 across all training set sizes:

    This suggests the model is memorizing the training data perfectly, which is a classic sign of overfitting.

  • Validation Accuracy starts low (~0.1) and rises to ~0.85:

    While the validation accuracy improves with more data, the persistent gap between training and validation accuracy indicates poor generalization early on.

  • Even at the largest training size, the model still performs significantly worse on unseen data than on training data.

📈 What a Healthy Learning Curve Might Look Like

  • Training accuracy should decrease slightly as training size increases (less memorization).
  • Validation accuracy should increase and converge toward training accuracy.
  • A smaller gap between the two curves suggests better generalization.

🧠 Why Random Forests Can Overfit

  • If the number of trees is too high or if each tree is allowed to grow too deep, the ensemble can overfit.
  • Especially with small datasets, random forests can memorize patterns that don’t generalize.

Metrics and Results (CNN - Grey scale)

Hyperparameter Sweep

Hyperparameter Sweep CNN - grey scale

Learning Curve 2

Learning Curve CNN - grey scale

Conv Layers Epochs Patience Accuracy F1 Precision
2 10 2 0.0750 0.0143 0.0079
2 10 5 0.1000 0.0229 0.0129
3 10 2 0.1000 0.0182 0.0100
3 10 5 0.1000 0.0182 0.0100
2 15 2 0.1000 0.0186 0.0103
2 15 5 0.1000 0.0186 0.0103
3 15 2 0.1250 0.0450 0.0361
3 15 5 0.1000 0.0182 0.0100
2 30 2 0.1000 0.0182 0.0100
2 30 5 0.2000 0.0750 0.0476
3 30 2 0.1000 0.0182 0.0100
3 30 5 0.1000 0.0182 0.0100

CNN Model Assessment - Grayscale Data

🚨 Red Flags in the Learning Curve

  • Training Accuracy rises to 1.0 by epoch 5: The model is perfectly memorizing the training data.
  • Validation Accuracy stays flat at ~0.2: The model is not generalizing at all. It is essentially guessing on unseen data.

🔍 Possible Causes

  • Data Issues:
    • Grayscale input might lack sufficient contrast or features.
    • Labels could be noisy or mismatched.
  • Model Complexity: The CNN might be too deep or have too many parameters for the dataset size.
  • Overtraining:
    • No regularization (e.g., dropout, weight decay).
    • No early stopping.

Note: All attempts to reduce overtraining did not work. It is postulated that the dataset needs to be larger for the CNN to learn meaningful patterns.

Metrics and Results (CNN - Color scale)

Hyperparameter Sweep

Hyperparameter Sweep - Color Spectrogram - CNN

Learning Curve 2

Learning Curve - Color Spectrogram - CNN

Final CNN Model Stats

  • Training Accuracy: 0.1937
  • Training Loss: 2.1730
  • Validation Accuracy: 0.1750
  • Validation Loss: 2.1234
  • Epoch 5: Early stopping triggered
  • Restoring model weights from the best epoch: 1

Final Best Model Metrics:

  • Accuracy: 0.1000
  • Precision: 0.0100
  • F1 Score: 0.0182

Color Spectrogram CNN Learning Curve

This graph shows a modestly improving CNN model trained on color spectrogram data, but it’s still underperforming overall.

📈 What the Learning Curve Shows:

  • Training Accuracy steadily increases from 0.0 to ~0.18 by epoch 4.
  • Validation Accuracy peaks at epoch 2 (~0.20), then slightly declines and flattens.

🧠 Interpretation:

  • The model is learning, but very slowly.
  • The validation peak at epoch 2 suggests the model briefly generalized well, but then started to overfit.
  • The low overall accuracy (max ~0.20) implies the model is struggling to extract meaningful features from the spectrograms.

🔍 Possible Issues:

  • Spectrogram preprocessing might be suboptimal (e.g., poor resolution, noisy input).
  • Model architecture may be too shallow or not well-tuned for this type of data.
  • Class imbalance or label noise could be limiting performance.
  • Too few epochs — the model might need more time to converge.

Note: Epoch 2 was the optimal epoch from the hyperparameter sweep. A postulated fix is to use a larger dataset to improve generalization and model performance.


(7) future steps/recommendations

Results & Conclusion

The primary goal of this project was to develop a machine learning system capable of recognizing both the language spoken in audio files and the musical genre, in order to enhance personalization in AI-driven music recommendation platforms. By accurately identifying spoken language within songs and combining this information with genre metadata, the system aims to suggest tracks that more closely align with individual user preferences. The challenge involved processing raw audio data, separating vocal from instrumental components, extracting meaningful statistical and time-frequency features, and applying both classical machine learning models and deep learning architectures to capture the underlying patterns in music.

To address these goals, the team experimented with a variety of models. Classical approaches—including Logistic Regression, Random Forests, and Support Vector Machines—were trained on extracted audio features using cross-validation, yielding modest predictive performance with accuracy, precision, recall, and F1-scores generally between 10–60%. K-Nearest Neighbors and Random Forests were applied for genre classification, while Convolutional Neural Networks were trained on Mel spectrogram images (grayscale and color), using early stopping and hyperparameter sweeps to optimize performance. Although the models demonstrated only limited overall accuracy, CNNs showed comparatively stronger results and align with literature on deep learning’s potential for audio analysis. Future improvements may include expanding the dataset size, refining spectrogram preprocessing, exploring deeper or more specialized architectures, and integrating more robust feature engineering to enhance both language and genre recognition for more effective recommendation systems.

Audio Player Demo

A demo of the Ui/Ux audio player written for this project. First a few songs are scrolled through to demonstarate the functionality of ‘real time’ generation of dB v. freq. curves for .wav and .mp3. Next a song is played to demonstrat the ‘real time’ audio analysis with the spectrogram (heat map) feature.

sources

[1] https://link.springer.com/chapter/10.1007/978-981-97-4533-3_6

[2] https://arxiv.org/html/2411.14474v1